In this tutorial, we’ll see how to use R to get information from the Web using a public API. This markdown document was built using the {rmdformats} package. You can download the .Rmd file here.
What is an API?
In the previous tutorial, we saw how to scrape data in a way that essentially mimicked what a human user would do: we went to a URL, identified the information we wanted by parsing the HTML codes and using CSS selectors, then we “copied” that information into a dataframe in R or into separate text files. This method is a common approach to scraping, but it is not the only, or even the most efficient, method. APIs are another very common way to access and acquire data from the Web.
Instead of downloading a dataset or scraping a site, APIs allow you to request data directly from a website through what’s called an Application Programming Interface. Many large sites like Reddit, Twitter, Spotify, and Facebook provide APIs so that data analysts and scientists can access data quickly, reliably, and legally. This last bit is important. Always check if a website has an API before scraping by other means. The following brief explanation is adapted from this post at dataquest.io.
‘API’ is a general term for the place where one computer program interacts with another, or with itself. We will be working with web APIs here, where two different computers—a client and server—interact with each other to request and provide data, respectively. APIs provide a way for us to request clean and curated data from a website. When a website sets up an API, they are essentially setting up a computer that waits for data requests from other users.
Once this computer receives a data request, it will do its own processing of the data and send it to the computer that requested it. From our perspective as the requester, we will need to write code in R that creates the request and tells the computer running the API what we want. That computer will then read our code, process the request, and return nicely-formatted data that we can work with in existing R libraries.
Why is this valuable? Contrast the API approach to “pure” webscraping that we used in the previous tutorial. When a programmer scrapes a web page, they receive the data in a messy chunk of HTML. While we were able to use libraries, e.g. {rvest}, to make parsing HTML text easier, we still had to go through multiple steps to identify the page URLs, and the correct bits of HTML and CSS to give us what we wanted. This wasn’t too hard with our toy examples, but it can often be quite complicated.
APIs offer a way to get data that we can immediately use, which can save us a lot of time and frustration. Many commonly used sites have R packages that are specifically dedicated to interfacing with those sites’ APIs. {rtweet} is such a library for getting tweets from Twitter’s API. Other examples include {RedditExtractoR}, {twitteR}, {Rfacebook}, {geniusr}, and {spotifyr}. Otherwise, you can use the {httr} and {jsonlite} packages to work with APIs more generally. These are a bit more advanced, and we will only briefly go into these at the end of this session (but see here and here for an introduction).
Getting started
R libraries
Libraries we’ll be using:
library(tidyverse) # for data wrangling
library(tictoc) # for timing processes
library(here) # for creating consistent file paths
library(usethis) # for editing environment filesWe’ll also be making extensive use of the {tidytext} package, which you can find more info about in the Text Mining with R book online.
In the following section we’ll take a quick look at a few different packages for interfacing with specific APIs. These packages are
# R libraries for interfacing with APIs
library(RedditExtractoR) # for scraping reddit forums
library(rtweet) # for getting tweets
library(httr) # for interfacing with APIs in general
library(jsonlite) # for parsing JSONWorking with API wrapper packages
We’ll start with reddit. There’s a lot you could do with reddit data, and fortunately the {RedditExtractoR} package makes it very easy to get data. Using it is very simple with the get_reddit() function.
For example, suppose we want to see what people are saying about Spinosaurus, a genus of dinosaur that turns out to have been even cooler and more unusual than we thought.
Spinosaurus aegyptiacus. Image source: https://www.nhm.ac.uk/discover/news/2020/may/dinosaur-diaries-spinosaurus-sauropod-necks-starry-lizard.html
The get_reddit() function returns a dataframe of comment threads and other information about a given search term. We’ll use the simple search term “spinosaurus,” and limit the results to the subreddit “Dinosaurs.” This will give us all the comment threads matching these criteria.
tic()
spinosaur_threads <- get_reddit(
search_terms = "Spinosaur",
page_threshold = 1, # number of pages to be searched
subreddit = "Dinosaurs"
)
|
| | 0%
|
|=== | 4%
|
|====== | 8%
|
|======== | 12%
|
|=========== | 16%
|
|============== | 20%
|
|================= | 24%
|
|==================== | 28%
|
|====================== | 32%
|
|========================= | 36%
|
|============================ | 40%
|
|=============================== | 44%
|
|================================== | 48%
|
|==================================== | 52%
|
|======================================= | 56%
|
|========================================== | 60%
|
|============================================= | 64%
|
|================================================ | 68%
|
|================================================== | 72%
|
|===================================================== | 76%
|
|======================================================== | 80%
|
|=========================================================== | 84%
|
|============================================================== | 88%
|
|================================================================ | 92%
|
|=================================================================== | 96%
|
|======================================================================| 100%
toc()62.88 sec elapsed
Notice that this function can take a while to run, and it conveniently prints a progress bar to help.
Let’s see what we have in there.
spinosaur_threads %>%
glimpse()Rows: 399
Columns: 18
$ id <int> 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2,~
$ structure <chr> "1", "1_1", "2", "2_1", "1", "1_1", "2", "2_1", "1", ~
$ post_date <chr> "16-03-21", "16-03-21", "16-03-21", "16-03-21", "20-0~
$ comm_date <chr> "16-03-21", "16-03-21", "17-03-21", "17-03-21", "20-0~
$ num_comments <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,~
$ subreddit <chr> "Dinosaurs", "Dinosaurs", "Dinosaurs", "Dinosaurs", "~
$ upvote_prop <dbl> 0.75, 0.75, 0.75, 0.75, 0.74, 0.74, 0.74, 0.74, 1.00,~
$ post_score <dbl> 2, 2, 2, 2, 7, 7, 7, 7, 6, 6, 6, 6, 62, 62, 62, 62, 3~
$ author <chr> "cyanide_sunrise2002", "cyanide_sunrise2002", "cyanid~
$ user <chr> "JWraptor3", "cyanide_sunrise2002", "Birds_are_therop~
$ comment_score <dbl> 5, 1, 2, 1, 5, 4, 1, 1, 6, 1, 3, 1, 4, 8, 3, 3, 3, 1,~
$ controversiality <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
$ comment <chr> "Its anatomy does completely condridict a piscivorous~
$ title <chr> "Was Ceratosaurus piscivorous?", "Was Ceratosaurus pi~
$ post_text <chr> "Hi all,\n\nI'm opening this discussion because I rec~
$ link <chr> "https://www.reddit.com/r/Dinosaurs/comments/m6c65f/w~
$ domain <chr> "self.Dinosaurs", "self.Dinosaurs", "self.Dinosaurs",~
$ URL <chr> "http://www.reddit.com/r/Dinosaurs/comments/m6c65f/wa~
If we arrange the comments by thread title, we can look at the comments and authors. (as usual, scroll through columns with the little arrow in the upper right)
spinosaur_threads %>%
arrange(title) %>%
select(num_comments, title, author, comment) %>%
arrange(desc(num_comments)) # put most popular thread firstThis is cool. It’s clear that the text might need some cleaning up, but the data is all there. Super simple!
There are a couple more things to mention about this package. The first is that we can easily graph the comment chains contained in a given thread. This tells us which comments are replied to, and how is replying to them. For example we have the thread titled “[Question] Questions about the Spinosaurs.” authored by user A_Charmandur, which has 17 total comments.
spinosaur_threads %>%
dplyr::filter(title == "[Question] Questions about the Spinosaurs.") %>%
select(title, num_comments, user, author)We can graph the comment chains with construct_graph() like so.
thread_chain <- spinosaur_threads %>%
dplyr::filter(title == "[Question] Questions about the Spinosaurs.") %>%
construct_graph(plot = TRUE)Alternatively, we can graph how the users interacted with one another in a given thread, aggregating over their comments (thicker lines indicate more frequent interactions). Here we use the user_network() function.
thread_network <- spinosaur_threads %>%
dplyr::filter(title == "[Question] Questions about the Spinosaurs.") %>%
user_network(include_author = TRUE, agg = TRUE)
thread_network$plotHow you might use this information is open ended, but this seems like a very useful too for studying language in use.
Now let’s look at Twitter. I like the {rtweet} package for getting tweets very easily. But in order to use it, you need a Twitter account so you can authorize {rtweet} to use your specific account credentials. This is due to the fact that there are limits to how many tweets you can download in a given time period.
By far the easiest way to do this is to simply request some tweets with the search_tweets() function. If you don’t have any credentials stored on your system, a browser window should open asking you to authorize the request the first time you run a search. Once you do this, an authorization token will now be stored in your .Renviron file on your system so you don’t have to re-authorize in this session (if you close R it will reauthenticate). You can access this file in several ways, but I like the edit_r_environ() function in the {usethis} package.
usethis::edit_r_environ() # open the .Renviron fileThis file contains environmental variables that R will use for various applications. Do not change anything in this now! We’ll come back to this file later on.
In this example, let’s look for tweets using the word cheugy, which has been a major topic of discussion lately (see e.g. articles in The Guardian, The Telegraph, Vox, and The New York Times). If you’re not familiar with the term, Urban Dictionary defines it as
Another way to describe aesthetics/people/experiences that are basic. It was coined by a now 23 year old white woman in 2013 while a student at Beverly Hills High School, on whom the irony is apparently lost. According to the New York Times, “cheugy (pronounced chew-gee) can be used, broadly, to describe someone who is out of date or trying too hard.”
So this is not a brand new word, but it’s new enough for most people to find interesting. More important, it’s a perfect example of how fast language can change, and Twitter, perhaps more than any other source, can help us investigate such rapid (and maybe fleeting) changes in the language of social media.
Search tweets
Here we’ll use the search_tweets() function in {rtweet}, which takes a search term and returns a dataframe. The function returns tweets from all languages, so we’ll make sure to include "lang:en" in our search term to limit our searches to English. The normal limit is 18k tweets in 15 minute span, but to keep it simple, we’ll just do the most recent 200 tweets with this word in them (note this may get different results each time you run it).
cheugy_tweets <- rtweet::search_tweets(
"cheugy lang:en", # the terms to search for
n = 200, # the number of tweets to collect
include_rts = FALSE # don't include retweets
)Now what do we get…
names(cheugy_tweets) [1] "user_id" "status_id"
[3] "created_at" "screen_name"
[5] "text" "source"
[7] "display_text_width" "reply_to_status_id"
[9] "reply_to_user_id" "reply_to_screen_name"
[11] "is_quote" "is_retweet"
[13] "favorite_count" "retweet_count"
[15] "quote_count" "reply_count"
[17] "hashtags" "symbols"
[19] "urls_url" "urls_t.co"
[21] "urls_expanded_url" "media_url"
[23] "media_t.co" "media_expanded_url"
[25] "media_type" "ext_media_url"
[27] "ext_media_t.co" "ext_media_expanded_url"
[29] "ext_media_type" "mentions_user_id"
[31] "mentions_screen_name" "lang"
[33] "quoted_status_id" "quoted_text"
[35] "quoted_created_at" "quoted_source"
[37] "quoted_favorite_count" "quoted_retweet_count"
[39] "quoted_user_id" "quoted_screen_name"
[41] "quoted_name" "quoted_followers_count"
[43] "quoted_friends_count" "quoted_statuses_count"
[45] "quoted_location" "quoted_description"
[47] "quoted_verified" "retweet_status_id"
[49] "retweet_text" "retweet_created_at"
[51] "retweet_source" "retweet_favorite_count"
[53] "retweet_retweet_count" "retweet_user_id"
[55] "retweet_screen_name" "retweet_name"
[57] "retweet_followers_count" "retweet_friends_count"
[59] "retweet_statuses_count" "retweet_location"
[61] "retweet_description" "retweet_verified"
[63] "place_url" "place_name"
[65] "place_full_name" "place_type"
[67] "country" "country_code"
[69] "geo_coords" "coords_coords"
[71] "bbox_coords" "status_url"
[73] "name" "location"
[75] "description" "url"
[77] "protected" "followers_count"
[79] "friends_count" "listed_count"
[81] "statuses_count" "favourites_count"
[83] "account_created_at" "verified"
[85] "profile_url" "profile_expanded_url"
[87] "account_lang" "profile_banner_url"
[89] "profile_background_url" "profile_image_url"
There’s A LOT of information here, and you can go to ?search_tweets to see the full run down of what these columns contain. But you should be able to see some potentially useful info here. For example we can look at who tweeted (screen_name), the date and time of the tweet, and the location they supplied, if any.
cheugy_tweets %>%
select(screen_name, created_at, location) So there’s lots we could do. For example, we can see the text of the tweets in the text column.
cheugy_tweets %>%
select(text) We can see who is using this word and when.
cheugy_tweets %>%
select(screen_name, created_at) It’s worth noting that the query syntax is a bit different from other packages. From the ?search_tweets help file:
Spaces behave like boolean “AND” operator. To search for tweets containing at least one of multiple possible terms, separate each search term with spaces and “OR” (in caps). For example, the search q = “data science” looks for tweets containing both “data” and “science” located anywhere in the tweets and in any order. When “OR” is entered between search terms, query = “data OR science,” Twitter’s REST API should return any tweet that contains either “data” or “science.” It is also possible to search for exact phrases using double quotes. To do this, either wrap single quotes around a search query using double quotes, e.g., q = ‘“data science”’ or escape each internal double quote with a single backslash, e.g., q = “"data science".”
So just a warning to be careful with your searches. For example, if we wanted to look for cheugy or cheug (as in “I’m a ‘cheug’ and proud of it”) we’d need to specify the query like so.
cheugy_tweets <- rtweet::search_tweets(
q = "cheugy OR cheug lang:en", # the terms to search for
n = 200, # the number of tweets to collect
include_rts = FALSE # don't include retweets
)Stream tweets
We can randomly sample (approximately 1%) from the live stream of all tweets with stream_tweets().
rt <- stream_tweets(
q = "cheugy OR cheug lang:en",
timeout = 30 # time in seconds to stream
)Or we could stream all tweets mentioning cheugy or cheug for a week.
# stream tweets for a week (60 secs * 60 mins * 24 hours * 7 days)
# Don't RUN THIS!
stream_tweets(
q = "cheugy OR cheug lang:en",
timeout = 60 * 60 * 24 * 7,
file_name = here("data_raw", "live_cheugy_tweets.json"),
parse = FALSE
)A couple things to note here. I’ve set this to save the unparsed output to the file live_cheugy_tweets.json in my project’s data_raw folder. This is a better method for longer streams. As noted in the ?stream_tweets documentation,
By default,
parse = TRUE, this function does the parsing for you. However, for larger streams, or for automated scripts designed to continuously collect data, this should be set toFALSEas the parsing process can eat up processing resources and time. For other uses, setting parse toTRUEsaves you from having to sort and parse the messy list structure returned by Twitter.
We can easily load and parse this to a tidy dataframe with parse_stream() like so.
cheugy_tweets <- parse_stream(here("data_raw", "live_cheugy_tweets.json"))Get friends and followers
It’s also easy to track who follows and who is followed by a given user. The get_friends() function collects a list of accounts followed by a particular user.
# get user IDs of accounts followed by @Rbloggers
rblog_friends <- get_friends("@Rbloggers", n = 1000)This gives us a dataframe with a single column of user IDs. If we want more info on these users we can find it with lookup_users().
rblog_friends %>%
pull(user_id) %>% # pull out the column as a vector
lookup_users()The same applies to the get_followers() function, which gets the accounts following a user.
# get user IDs of accounts following @Rbloggers
rblog_followers <- get_followers("@Rbloggers", n = 1000)
rblog_followers %>%
pull(user_id) %>% # pull out the column as a vector
lookup_users()Both the get_friends() and get_followers() allow a maximum of 5000 for a single API call, and you are limited to 15 such calls per 15 minutes. You can see the documentation for these functions for more information.
Search users
There are lots of functions for looking into Twitter data. For example, the search_user() function gives you up to 1000 users matching a search query.
rstats_users <- search_users("#rstats", n = 100)Get timelines
You can also search for the most 3200 tweets from a single user with get_timeline(). However, there’s an added wrinkle to this function. If you try to run it like so, you’re likely to get an error.
bbc_tml <- get_timeline("@BBCNews", n = 100)The issue appears to be that the default Twitter API token used by {rtweet} will not work for this. The solution is to create your own token and specify that in the token = ... argument. If you want to use this function, you should instead use Twitter’s app authentication mechanism.
Creating a Twitter app
It’s helpful to learn a little more about what’s going on behind the scenes. The first thing to know is that every request to the Twitter API has to go through an “app.” Normally, someone else has created the app for you (it’s still called an app even though you’ll be using it through an R package).
To create a Twitter app, you need to first apply for a free developer account by following the instructions at https://developer.twitter.com. Once you have been approved (which may take some time), go to the developer portal and click the “Create App” button at the bottom of the page. You’ll need to give a name for your app (the name is unimportant for our purposes), but it needs to be unique across all twitter apps.
After you’ve named your app, you’ll see a screen that gives you some important information. These are your API key, your API secret key, and Bearer token:
Twitter API keys and bearer token
You’ll only see this once, so you need to record it in a secure location. For now you can copy these to a text file. Don’t worry though—if you don’t record these or lose them, you can always regenerate them.
Once you’ve done this, click the “App settings” button, and go to the “Keys and tokens tab.” Now as well as the API key and secret you recorded earlier, you’ll also need to generate an “Access token and secret” which you can get by clicking the “Generate” button on the next to “Access Token and Secret”:
Twitter Access keys and tokens
Again, record these somewhere secure. What I like to do is create list in R with these values, and save that to my project directory. This way I can easily load it when I start a new session.
# Your values will differ
jgtwitterapp_keylist <- list(
api_key = "INSERT YOUR KEY HERE",
api_secret = "INSERT YOUR SECRET KEY HERE",
access_token = "INSERT YOUR ACCESS TOKEN HERE",
access_secret = "INSERT YOUR ACCESS SECRET HERE",
bearer_token = "INSERT YOUR BEARER TOKEN HERE"
)
# save to file
saveRDS(jgtwitterapp_keylist, here("keys", "jgtwitterapp_keys.rds"))So when I want to use this with {rtweet}, I can just load it and create a token with create_token().
jgtwitterapp_keylist <- readRDS(here("keys", "jgtwitterapp_keys.rds"))
my_token <- create_token(
app = "jgtwitterapp", # the name of my app
consumer_key = jgtwitterapp_keylist$api_key,
consumer_secret = jgtwitterapp_keylist$api_secret,
access_token = jgtwitterapp_keylist$access_token,
access_secret = jgtwitterapp_keylist$access_secret
)Now I should be able to get the timeline for a user account.
bbc_tml <- get_timeline(
"@BBCNews",
n = 100,
token = my_token
)
bbc_tmlIt works! Annoyingly, at the moment it seems that you need to run this same create_token() process each time you start a new R session. This is a bit clunky, but it’s the only way I have found to make it work consistently. The get_token() function should be able to load your app token automatically, but I think the code is a bit buggy. You can see this thread for more discussion.
Locating tweets
We can also try to locate tweets geographically. This is very useful for doing dialectology (see e.g. herehttp://tweetolectology.com/), or for just making sure your data comes from the location you want. One issue with {rtweet} however is that it does not seem to work well for getting locations and geo coordinates for tweets. The package provides a function lookup_coords() for looking up coordinates, but this relies on getting information on Google’s API which apparently does not play well with others. From the lookup_coords() help file:
Since Google Maps implemented stricter API requirements, sending requests to Google’s API isn’t very convenient. To enable basic uses without requiring a Google Maps API key, a number of the major cities throughout the word and the following two larger locations are baked into this function: ‘world’ and ‘usa.’ If ‘world’ is supplied then a bounding box of maximum latitutde/longitude values, i.e., c(-180, -90, 180, 90), and a center point c(0, 0) are returned. If ‘usa’ is supplied then estimates of the United States’ bounding box and mid-point are returned. To specify a city, provide the city name followed by a space and then the US state abbreviation or country name. To see a list of all included cities, enter
rtweet:::citycoordsin the R console to see coordinates data.
We can see some of the cities listed in rtweet:::citycoords below (Note: I’ve found that for some reason the US cities (e.g. “new york us” or “columbus ohio us”) in this list don’t seem to be working with search_tweets. I’m not sure why this is, and can’t find any information online. The UK and other European cities do seem to work though.)
rtweet:::citycoordsSo for example, if we wanted to collect tweets from here in Birmingham, we’d use the name in the dataframe above in lookup_coords() like so.
lookup_coords("birmingham england")$place
[1] "birmingham england"
$box
sw.lng.lng sw.lat.lat ne.lng.lng ne.lat.lat
-1.966667 52.366667 -1.866667 52.466667
$point
lat lng
52.416667 -1.916667
attr(,"class")
[1] "coords" "list"
And we’ll include a geocode argument in our search to get only tweets from within these coordinates.
bham_tweets <- search_tweets(
q = "lang:en",
n = 1000,
include_rts = FALSE, # don't include retweets
geocode = lookup_coords("birmingham england")
)
bham_tweets %>%
select(screen_name, location, place_name, geo_coords)There are several sources of geographic information in our tweets:
location: This is the user-defined location for an account’s profile. This can really be anything, so you have to be careful.place_nameandplace_full_name: When users decide to assign a location to their Tweet, they are presented with a list of candidate Twitter Places, and these contain the human-readable names of those places. Of course, not all users assign a location, so this is not always useful.geo_coords,coord_coords,bbox_coords: These contain the latitude and longitude coordinates of the tweet, if available
bham_tweets %>%
count(location, sort = T)You can find the latitude and longitude of those tweets that have them (you may have to go through a few pages…)
bham_tweets <- lat_lng(bham_tweets)
bham_tweets %>%
select(lat, lng)We can even map the locations of those that have the coordinates. We’ll use the {sf} package for working with geographical boundaries, and I’ve included a file postcode-boundaries.kml which contains all the information about postcode boundaries in the UK (I got this here. We read this in with st_read()
library(sf)
# read in the geographical data
postcodes <- here("data_raw", "postcode-boundaries.kml") %>%
st_read()Reading layer `Layer Postcode_Area_polys' from data source `C:\Users\grafmilj\Dropbox\GitHub\Bham_CLSS_2021\data_raw\postcode-boundaries.kml' using driver `KML'
Simple feature collection with 124 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension: XYZ
Bounding box: xmin: -8.178009 ymin: 49.16437 xmax: 1.771118 ymax: 60.84554
z_range: zmin: 0 zmax: 0
Geodetic CRS: WGS 84
Now we can plot the data in {ggplot2} with the geom_sf() layer.
uk_map <- ggplot(data = postcodes) +
geom_sf(fill = "grey95") +
theme(panel.background = element_rect(fill = "lightblue"))
uk_mapMap of postcodes in the UK
Easy! Now we can just zoom in on the Birmingham area and map our tweets.
# filter out tweets with missing coords
bham_tweets_located <- bham_tweets %>%
dplyr::filter(!is.na(lat))
# read in a small dataframe of the lat and long for a few local cities
wm_towns <- read.csv(here("data_raw", "wm_towns.csv"))
uk_map + # limit the plot to these lat (the ylim) and long (the xlim) values
coord_sf(xlim = c(-2.4, -1.5), ylim = c(52.1, 52.6), expand = FALSE) +
geom_text(
data = wm_towns,
aes(x = long, y = lat, label = city)
) +
geom_point(
data = bham_tweets_located,
aes(x = lng, y = lat),
color = "red", alpha = .4
) It’s not the best, but there is a lot more you could do with maps if you’re really keen on using geolocated data.
There’s a lot more you can do with {rtweet}, and I encourage you to check out some of the online guides available, e.g. the creator Michael Kearney’s help here and here.
Alternatively, there is the {twitteR} package, which offers some of the same functionality as {rtweet}. You can find help on how to use {twitteR} in this vignette.
Working with general APIs
Not all APIs have convenient packages dedicated to their use, so it’s likely that you may need to interface with an API directly. To do this we’ll use the {httr} package to work with Web APIs. Again, Web APIs involve two computers: a client and a server. The client submits a Hypertext Transfer Protocol (HTTP) request to the server and the server returns a response to the client. The response contains status information about the request and may also contain the requested content. The packages we’ve just seen do all this as well, but it happens behind the scenes. Now we’re going to pull the curtain back a bit and see how it works step-by-step. But this can be complicated, so I’ll just cover the very basics here.
I recommend Getting started with httr vignette for more details. The other library we’ll use is {jsonlite}, which is used for working with JSON formatted data (more on this below). Most APIs transmit data in JSON, so we’ll need to convert it into something more familiar.
library(httr)
library(jsonlite)Basic steps
In the simplest case, to make a request all you need is a URL for the API. The example I’ll use here is https://github.com/beanboi7/yomomma-apiv2, which is free site that stores “yo momma” jokes. I found this among this list of free APIs on GitHub. There are many more you can find if you poke around.
We send a request with the GET() function, along with information. If you go to the website above, it gives information about the endpoint, which is the URL we’ll use in our request, as well as any query parameters that we can set. Generally, most sites with web APIs will give you some details about how to use them, including the endpoint and other search criteria.
So we store our endpoint, and include it in our GET() request:
ym_path <- "https://yomomma-api.herokuapp.com/jokes"
ym_request <- GET(
url = ym_path,
query = list(count = 10) # the number of jokes to get
)
ym_requestResponse [https://yomomma-api.herokuapp.com/jokes?count=10]
Date: 2021-06-07 14:26
Status: 200
Content-Type: application/json
Size: 788 B
We can check whether our request worked just to be sure.
http_status(ym_request)$category
[1] "Success"
$reason
[1] "OK"
$message
[1] "Success: (200) OK"
That’s what we want to see. Now we can extract the content.
ym_content <- content(ym_request, as = "text", encoding = "UTF-8")
ym_content[1] "[{\"joke\":\"Yo momma is so fat that she cangt even fit into an AOL chat room.\"},{\"joke\":\"Yo momma is so stupid that she sold the house to pay the mortgage.\"},{\"joke\":\"Yo momma's so ugly, even Tamaki wouldn't hit on her.\"},{\"joke\":\"Yo momma is so fat that her waist size is the Equator.\"},{\"joke\":\"Yo momma's breath smells so bad that when she yawns her teeth duck out of the way.\"},{\"joke\":\"Yo momma is so fat that even Chuck Norris couldn't run around her.\"},{\"joke\":\"Yo momma's so ugly, she's the real reason sasuke left the village.\"},{\"joke\":\"Yo momma's so fat that she expresses her weight in scientific notation.\"},{\"joke\":\"Yo momma's so stupid that whenever someone rings the doorbell, she checks the microwave.\"},{\"joke\":\"Yo momma is so dirty that she loses weight in the shower.\"}]"
This isn’t in a nice format, and this is because the content is in JSON, which stands for JavaScript Object Notation. JSON is useful because it is easily readable by a computer, and for this reason, it has become the primary way that data is communicated through APIs. Most APIs will send their responses in JSON format.
This is where the {jsonlite} package comes in, since it contains useful funcitons for converting JSON code into more familiar data objects in R.
# flatten tells it to create a single unnested dataframe
ym_jokes_df <- fromJSON(ym_content, flatten = TRUE) %>%
data.frame()
ym_jokes_dfThat’s all there is to it! In reality things are not always that simple, and for access to many APIs you will need to register an application with the website. I’ve included a few more examples below.
More examples
Cat facts
You can scrape list of facts about cats here: https://alexwohlbruck.github.io/cat-facts/docs/
cat_path <- "https://cat-fact.herokuapp.com/facts"
cat_facts <- GET(
url = cat_path,
)
http_status(cat_facts)cat_df <- content(cat_facts, as = "text", encoding = "UTF-8") %>%
fromJSON(flatten = TRUE) %>%
data.frame()
cat_df %>%
select(text)Articles in the Guardian
The Guardian has a free open API for anyone to use. All you need to do is register for a developer app here: https://open-platform.theguardian.com/documentation/
Once you do you will be sent your API key and get a URL to use that will look like this.
# the XXXXXXXXXXXXXXXXXX will be your API key
"https://content.guardianapis.com/search?api-key=XXXXXXXXXXXXXXXXXXXXXXXXXXX"Alternatively, you can save your key and then load it when you need it. Your key should not be shared (which is why I don’t include it here).
# read in my saved key and paste it to the path
gd_api <- readRDS(here::here("keys", "guardian_keys.rds"))
gd_path <- paste0("https://content.guardianapis.com/search?api-key=",
gd_api$api_key)
# Loaded
gd_request <- GET(
url = gd_path,
query = list(
q = "dinosaur" # pieces mentioning dinosaurs
)
)
# Check status
http_status(gd_request)$category
[1] "Success"
$reason
[1] "OK"
$message
[1] "Success: (200) OK"
Get the content and parse the JSON code. Different sites provide different information, so you need to check it.
gd_content <- content(gd_request, as = "text", encoding = "UTF-8") %>%
fromJSON(flatten = TRUE) %>%
data.frame()
# What's in there?
names(gd_content) [1] "response.status" "response.userTier"
[3] "response.total" "response.startIndex"
[5] "response.pageSize" "response.currentPage"
[7] "response.pages" "response.orderBy"
[9] "response.results.id" "response.results.type"
[11] "response.results.sectionId" "response.results.sectionName"
[13] "response.results.webPublicationDate" "response.results.webTitle"
[15] "response.results.webUrl" "response.results.apiUrl"
[17] "response.results.isHosted" "response.results.pillarId"
[19] "response.results.pillarName"
Look at the date and titles
gd_content %>%
select(response.results.webPublicationDate, response.results.webTitle)Entries in the Oxford English Dictionary
Again, you’ll need to register a developer account (https://developer.oxforddictionaries.com/), and you can get a free version (with limits) easily. In this case, you’ll need both your app ID and your app key, and we’ll include these in the add_headers() argument for the GET() request. I figured this out by looking at the little example of the Python code at the bottom of the developer page (they don’t have an R example, but the two work very similarly).
Information about how to create your GET request from the OED API
oed_keys <- readRDS(here::here("keys", "oed_keys.rds"))
# note that the endpoint path contains the word you are searching for here
word <- "dinosaur"
ox_path <- paste0("https://od-api.oxforddictionaries.com/api/v2/entries/en-gb/", word)
ox_request <- GET(
url = ox_path,
add_headers(
app_id = oed_keys$app_id,
app_key = oed_keys$app_key
))
http_status(ox_request)$category
[1] "Success"
$reason
[1] "OK"
$message
[1] "Success: (200) OK"
It works! Now we could create our own wrapper function that gets the request and parses it all in one go:
# function for getting entries from the OED API
get_OED_entry <- function(word, lang = "en-gb"){
# Load the keys if not already in the workspace
if(!"oed_keys" %in% ls()) oed_keys <- readRDS(here::here("keys", "oed_keys.rds"))
path <- paste(
"https://od-api.oxforddictionaries.com/api/v2/entries",
lang,
tolower(word),
sep = "/"
)
ox_request <- GET(
url = path,
add_headers(
app_id = oed_keys$app_id,
app_key = oed_keys$app_key
))
if(ox_request$status_code != 200){
http_status(ox_request) %>%
print()
} else {
ox_request %>%
content(as = "text", encoding = "UTF-8") %>%
fromJSON(flatten = TRUE) %>%
data.frame()
}
}kraken_entry <- get_OED_entry("kraken")
kraken_entry %>%
glimpse()Rows: 1
Columns: 10
$ id <chr> "kraken"
$ metadata.operation <chr> "retrieve"
$ metadata.provider <chr> "Oxford University Press"
$ metadata.schema <chr> "RetrieveEntry"
$ results.id <chr> "kraken"
$ results.language <chr> "en-gb"
$ results.lexicalEntries <list> [<data.frame[1 x 5]>]
$ results.type <chr> "headword"
$ results.word <chr> "kraken"
$ word <chr> "kraken"
Notice that the object returned by the API is a complex list with multiple embedded lists and dataframes, so you’ll have to do some exploration.
kraken_entry$results.lexicalEntries %>%
glimpse()List of 1
$ :'data.frame': 1 obs. of 5 variables:
..$ entries :List of 1
.. ..$ :'data.frame': 1 obs. of 3 variables:
..$ language : chr "en-gb"
..$ text : chr "kraken"
..$ lexicalCategory.id : chr "noun"
..$ lexicalCategory.text: chr "Noun"
This is an unnamed list whose first argument is a dataframe of entries. We can see what this looks like:
kraken_entry$results.lexicalEntries %>%
first() %>% # pull the first item of a list
glimpse()Rows: 1
Columns: 5
$ entries <list> [<data.frame[1 x 3]>]
$ language <chr> "en-gb"
$ text <chr> "kraken"
$ lexicalCategory.id <chr> "noun"
$ lexicalCategory.text <chr> "Noun"
So the entries are a list with one dataframe as its argument (this is getting ridiculous…). Let’s see what that looks like.
kraken_entry$results.lexicalEntries %>%
first() %>%
pull(entries) %>% # pull the content of a data.frame column
first() %>%
glimpse()Rows: 1
Columns: 3
$ etymologies <list> "Norwegian"
$ pronunciations <list> [<data.frame[1 x 4]>]
$ senses <list> [<data.frame[1 x 5]>]
Oh good grief! This seems crazy, but it actually makes some sense, as there is a lot of information in a dictionary entry, and a complex object like this is not a bad way to keep it organised. Once we know the structure, it would be rather simple to create functions to get it quicky.
# get our definition
kraken_entry$results.lexicalEntries %>%
first() %>%
pull(entries) %>% # pull the content of a data.frame column
first() %>%
pull(senses) %>%
first() %>%
pull(definitions) %>%
simplify() # collapse a list to a vector[1] "an enormous mythical sea monster said to appear off the coast of Norway."
So we have a process for getting definitions. We can test it on a form with multiple meanings.
bank_entry <- get_OED_entry("bank")
bank_entry$results.lexicalEntries %>%
first() %>%
pull(entries) %>% # pull the content of a data.frame column
first() %>%
pull(senses) %>%
first() %>%
pull(definitions) %>%
simplify()[1] "the land alongside or sloping down to a river or lake"
[2] "a long, high mass or mound of a particular substance"
[3] "a set of similar things, especially electrical or electronic devices, grouped together in rows"
[4] "the cushion of a pool table"
Nice. You can imagine creating functions that get definitions (or pronunciations, etymologies, etc) from entry objects very easily.